Your final is due by the end of the last week of class. You should post your solutions to your GitHub account or RPubs. You are also expected to make a short presentation via YouTube and post that recording to the board. This project will show off your ability to understand the elements of the class.
Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu=\sigma=\frac{(N+1)}{2}\)
Following the instructions from above, here is my R syntax to generate a random variable X:
set.seed(123)
N <- 10
X <- runif(10000, 1, N)
Y <- rnorm(10000, (N+1)/2, (N+1)/2)
From the above syntax, you can see that I’ve created a variable \(X\) with 10,000 random uniform numbers from 1 to \(N\) (\(N=10\)). Additionally, I’ve created a random variable \(Y\) that has 10,000 random normal numbers with a \(\mu\) and \(\sigma\) of \(\frac{N+1}{2}\).
Probability: Calculate as a minimum the below probabilities a through c. Assume the small letter \(x\) is estimated as the median of the \(X\) variable, and the small letter \(y\) is estimated as the 1st quartile of the \(Y\) variable. Interpret the meaning of all probabilities.
To work through this, I’ll first have to calculate \(x\) (the median) and \(y\) (the first quartile). Additionally, I’ve saved \(X\) and \(Y\) in a dataframe can stored the total number of rows (10,000) in a variable:
x <- quantile(X, 0.50)
y <- quantile(Y, 0.25)
df <- data.frame(X = X, Y = Y)
total_rows <- nrow(df)
a) \(P(X>x \ | \ X>y)\)
We can use the following equation to find the probability:
\[P(A \ | \ B) = \frac{P(A \ \cap \ B)}{P(B)} \\ where \ A = P(X>x) \ and \ B = P(X>y)\]
With this in mind, we can solve:
P_AandB <- nrow(df %>% filter(X > x & X > y)) / total_rows
P_B <- nrow(df %>% filter(X > y)) / total_rows
P_AgivenB <- P_AandB / P_B
P_AgivenB
## [1] 0.5509642
Solution: The probability \(P(X>x \ | \ X>y) = 0.5510\), which means that the probability that a uniform number from 1 to 10 is greater than the median of 5.45 given this number is greater than 1.84 is 0.5510.
b) \(P(X>x, \ Y>y)\)
We can use the following equation to solve this portion:
\[P(A,B)=P(A) \times P(B) \\ where \ A=N(X>x) \ and \ B=N(Y>y)\]
With this in mind, we can solve:
P_AandB_2 <- nrow(df %>% filter(X > x & Y > y)) / total_rows
P_AandB_2
## [1] 0.3756
Solution: The probability \(P(X>x, \ Y>y) = 0.3756\), which means that the probability that a uniform number from 1 to 10 is greater than the median of 5.45 and this number is greater than 1.84 is 0.3756.
c) \(P(X<x, \ X>y)\)
Similar to part a, we can use the following equation to solve:
\[P(A \ | \ B) = \frac{P(A \ \cap \ B)}{P(B)} \\ where \ A = P(X<x) \ and \ B = P(X>y)\]
With this in mind, we can solve:
P_AandB <- nrow(df %>% filter(X < x & X > y)) / total_rows
P_B <- nrow(df %>% filter(X > y)) / total_rows
P_AgivenB_2 <- P_AandB / P_B
P_AgivenB_2
## [1] 0.4490358
Solution: The probability \(P(X<x \ | \ X>y) = 0.4490\), which means that the probability that a uniform number from 1 to 10 is less than the median of 5.45 given this number is greater than 1.84 is 0.4490.
Investigate whether \(P(X>x \ and \ Y>y)=P(X>x)P(Y>y)\) by building a table and evaluating the marginal and joint probabilities.
To do this investigation, we’ll first have to calculate the joint probabilities:
We’ll calculate joint probabilities by using the equation:
\[P(A \cap B) = P(A \times B)\] Therefore, we first will create a few columns to distinguish whether our values meet the following conditions:
Then, we can take the count of each condition, and multiply the four conditions to get their respective joint probabilities:
Below is the R syntax I used to do this, by using group_by() and summarise() functions to do the calculations with the tidyverse package:
# Joint probabilities
prob_df <- df %>%
mutate(A = ifelse(X > x, " X > x", " X < x"), B = ifelse(Y > y, " Y > y", " Y < y")) %>%
group_by(A, B) %>%
summarise(count = n()) %>%
mutate(prob = count / total_rows)
Next, we need to back up a bit and examine the marginal probabilities of \(P(A)\) and \(P(B)\) independently:
Intuitively we know that \(x = median(X)\), therefore, half of the values of \(X\) > \(x\) and the other half of \(X\) < \(x\). Similarly, since we know that \(y = first \ quartile \ of \ Y\), a quarter of \(Y\) < \(y\) and the remaining three fourths of the values of \(Y\) > \(y\). Therefore, the marginal probabilities are:
We can check these assumptions by reworking our table:
prob_df <- prob_df %>%
ungroup() %>%
group_by(A) %>%
summarise(count = sum(count), prob = sum(prob)) %>%
mutate(B = "Total Prob") %>%
bind_rows(prob_df)
prob_df <- prob_df %>%
ungroup() %>%
group_by(B) %>%
summarise(count = sum(count), prob = sum(prob)) %>%
mutate(A = "Total Prob") %>%
bind_rows(prob_df)
And by printing our table, we can see that our marginal probabilities that we assumed are confirmed:
prob_df %>%
select(-count) %>%
spread(A, prob) %>%
rename("Conditions" = B) %>%
kable() %>%
kable_styling(bootstrap_options = c('striped'), full_width = FALSE)
| Conditions | X < x | X > x | Total Prob |
|---|---|---|---|
| Y < y | 0.1256 | 0.1244 | 0.25 |
| Y > y | 0.3744 | 0.3756 | 0.75 |
| Total Prob | 0.5000 | 0.5000 | 1.00 |
Solution: After generating the table above, it appears that \(P(X>x \ and \ Y>y)\) does equal \(P(X>x)P(Y>y)\), since they both have a value of 0.3756. By doing the calculations to check, \(P(X>x)P(Y>y) = (0.50)(0.75) = 0.375\) and \(P(X>x \ and \ Y>y) = 0.375\) as well.
Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
In order to run our Fisher’s Exact Test and Chi Square Test, we’ll first need to find the counts of each our conditions, similar to the previous question and create a two-by-two table:
X_plus <- nrow(df %>%
filter(X > x))
X_lesseq <- nrow(df %>%
filter(X <= x))
Y_lesseq <- nrow(df %>%
filter(Y <= y))
Y_plus <- nrow(df %>%
filter(Y > y))
freq_matrix <- matrix(c(X_plus, X_lesseq, Y_plus, Y_lesseq),
nrow = 2, ncol = 2, byrow = TRUE,
dimnames = list(c("x", "y"),
c("X > x; Y > y", "X <= x; Y <= y")))
With our frequency table ready to go, we can see our counts below:
freq_matrix %>%
kable() %>%
kable_styling(bootstrap_options = c('striped'), full_width = FALSE)
| X > x; Y > y | X <= x; Y <= y | |
|---|---|---|
| x | 5000 | 5000 |
| y | 7500 | 2500 |
Now, we are ready to run our Fisher’s Exact Test:
fisher.test(freq_matrix)
##
## Fisher's Exact Test for Count Data
##
## data: freq_matrix
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.3137986 0.3540659
## sample estimates:
## odds ratio
## 0.3333333
And here is the Chi-Square Test:
chisq.test(freq_matrix)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: freq_matrix
## X-squared = 1332.3, df = 1, p-value < 2.2e-16
When looking at both of the outputs, we can see that we reject the null hypothesise for both, given that the p-value is less than 0.05 for the 95 percent confidence interval. Since the Fisher’s Exact Test is typically used for smaller sample sizes, and our sample is of 10,000 observations, it would be more appropriate to use the Chi-Square Test.
Solution:
Our \(H_0\) = There is no relationship between X and Y.
Our \(H_a\) = There is a significant relationship between X and Y.
Given our outputs, independence does not hold when we run the Fisher’s Exact Test and Chi-Square test. We reject the null hypothesis that there is no relationship between X and Y. Since the Fisher’s Exact Test is typically used on smaller sample sizes, and we have a large sample size of 10,000 observations, it would be more appropriate to rely on the output from our Chi-Square Test.
You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition: https://www.kaggle.com/c/house-prices-advanced-regression-techniques. I want you to do the following.
Descriptive and Inferential Statistics: Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
kaggle_df <- read_csv('https://raw.githubusercontent.com/zachalexander/data605_cuny/master/Final%20Exam/train.csv')
# summary(kaggle_df)
On a quick look at the summary, I decided to pull out the numeric values to explore a bit more:
numerics <- data.frame(summary(kaggle_df))
numerics <- numerics %>%
filter(!is.na(Freq))
numerics <- data.frame(numerics %>%
filter(!grepl('Class', Freq), !grepl('Length', Freq), !grepl('Mode', Freq)) %>%
separate(Freq, c('Value', 'Count'), sep = ":", remove = TRUE))
numerics$Var2 <- trimws(numerics$Var2)
numerics$Value <- trimws(numerics$Value)
numerics$Count <- round(as.numeric(trimws(numerics$Count)), 1)
numerics <- numerics %>%
select(Var2, Value, Count) %>%
spread(Value, Count) %>%
rename("Attribute" = "Var2", "Q1" = `1st Qu.`, "Q3" = `3rd Qu.`, "Max" = "Max.", "Min" = "Min.", "NA" = "NA's") %>%
select(Attribute, Min, Q1, Median, Mean, Q3, Max, `NA`)
kable(numerics) %>%
kable_styling(bootstrap_options = 'striped')
| Attribute | Min | Q1 | Median | Mean | Q3 | Max | NA |
|---|---|---|---|---|---|---|---|
| 1stFlrSF | 334 | 882.0 | 1087.0 | 1163.0 | 1391.0 | 4692 | NA |
| 2ndFlrSF | 0 | 0.0 | 0.0 | 347.0 | 728.0 | 2065 | NA |
| 3SsnPorch | 0 | 0.0 | 0.0 | 3.4 | 0.0 | 508 | NA |
| BedroomAbvGr | 0 | 2.0 | 3.0 | 2.9 | 3.0 | 8 | NA |
| BsmtFinSF1 | 0 | 0.0 | 383.5 | 443.6 | 712.2 | 5644 | NA |
| BsmtFinSF2 | 0 | 0.0 | 0.0 | 46.5 | 0.0 | 1474 | NA |
| BsmtFullBath | 0 | 0.0 | 0.0 | 0.4 | 1.0 | 3 | NA |
| BsmtHalfBath | 0 | 0.0 | 0.0 | 0.1 | 0.0 | 2 | NA |
| BsmtUnfSF | 0 | 223.0 | 477.5 | 567.2 | 808.0 | 2336 | NA |
| EnclosedPorch | 0 | 0.0 | 0.0 | 21.9 | 0.0 | 552 | NA |
| Fireplaces | 0 | 0.0 | 1.0 | 0.6 | 1.0 | 3 | NA |
| FullBath | 0 | 1.0 | 2.0 | 1.6 | 2.0 | 3 | NA |
| GarageArea | 0 | 334.5 | 480.0 | 473.0 | 576.0 | 1418 | NA |
| GarageCars | 0 | 1.0 | 2.0 | 1.8 | 2.0 | 4 | NA |
| GarageYrBlt | 1900 | 1961.0 | 1980.0 | 1979.0 | 2002.0 | 2010 | 81 |
| GrLivArea | 334 | 1130.0 | 1464.0 | 1515.0 | 1777.0 | 5642 | NA |
| HalfBath | 0 | 0.0 | 0.0 | 0.4 | 1.0 | 2 | NA |
| Id | 1 | 365.8 | 730.5 | 730.5 | 1095.2 | 1460 | NA |
| KitchenAbvGr | 0 | 1.0 | 1.0 | 1.0 | 1.0 | 3 | NA |
| LotArea | 1300 | 7554.0 | 9478.0 | 10517.0 | 11602.0 | 215245 | NA |
| LotFrontage | 21 | 59.0 | 69.0 | 70.0 | 80.0 | 313 | 259 |
| LowQualFinSF | 0 | 0.0 | 0.0 | 5.8 | 0.0 | 572 | NA |
| MasVnrArea | 0 | 0.0 | 0.0 | 103.7 | 166.0 | 1600 | 8 |
| MiscVal | 0 | 0.0 | 0.0 | 43.5 | 0.0 | 15500 | NA |
| MoSold | 1 | 5.0 | 6.0 | 6.3 | 8.0 | 12 | NA |
| MSSubClass | 20 | 20.0 | 50.0 | 56.9 | 70.0 | 190 | NA |
| OpenPorchSF | 0 | 0.0 | 25.0 | 46.7 | 68.0 | 547 | NA |
| OverallCond | 1 | 5.0 | 5.0 | 5.6 | 6.0 | 9 | NA |
| OverallQual | 1 | 5.0 | 6.0 | 6.1 | 7.0 | 10 | NA |
| PoolArea | 0 | 0.0 | 0.0 | 2.8 | 0.0 | 738 | NA |
| SalePrice | 34900 | 129975.0 | 163000.0 | 180921.0 | 214000.0 | 755000 | NA |
| ScreenPorch | 0 | 0.0 | 0.0 | 15.1 | 0.0 | 480 | NA |
| TotalBsmtSF | 0 | 795.8 | 991.5 | 1057.4 | 1298.2 | 6110 | NA |
| TotRmsAbvGrd | 2 | 5.0 | 6.0 | 6.5 | 7.0 | 14 | NA |
| WoodDeckSF | 0 | 0.0 | 0.0 | 94.2 | 168.0 | 857 | NA |
| YearBuilt | 1872 | 1954.0 | 1973.0 | 1971.0 | 2000.0 | 2010 | NA |
| YearRemodAdd | 1950 | 1967.0 | 1994.0 | 1985.0 | 2004.0 | 2010 | NA |
| YrSold | 2006 | 2007.0 | 2008.0 | 2008.0 | 2009.0 | 2010 | NA |
With my numerics dataframe ready, I can now set up some visuals by splitting the dataframe into two, one with all quantative variables, and the other with categorical data:
kaggle_numerics <- kaggle_df[, (colnames(kaggle_df)) %in% numerics$Attribute]
kaggle_categorical <- kaggle_df[, !(colnames(kaggle_df) %in% numerics$Attribute)]
kaggle_numerics %>%
dplyr::select(2:11) %>%
pairs.panels(method = "pearson", hist.col = "#8bb397")
kaggle_numerics %>%
dplyr::select(12:21) %>%
pairs.panels(method = "pearson", hist.col = "#8bb397")
kaggle_numerics %>%
dplyr::select(22:31) %>%
pairs.panels(method = "pearson", hist.col = "#8bb397")
kaggle_numerics %>%
dplyr::select(32:38) %>%
pairs.panels(method = "pearson", hist.col = "#8bb397")
After creating a few plot matrices with our quantitative data, I thought there were a few initial takeaways:
Many of the living spaces, and the square footage values seem to show stronger correlations with one another. This makes sense given that many times, the square footage of one floor will likely be similar square footage to another floor in the house. For instance, there’s a strong correlation between the first floor square footage and the basement square footage. When thinking about our regression later, it’s nice to know that there are multiple attributes here that could serve as proxies (if needed).
Additionally, I’m seeing a gradual, positive trend towards larger areas of garages the later they are built in the 20th century – for instance, garages built in the early part of the 20th century seem to be smaller than those that are built in the later 1900s and early 2000s. This may be something to look into further, and it’s relationship with the Sale Price later on.
It’s interesting to see too that the year the house is built has a relationship with the rating of overall material and finish quality of the house. Again, this make sense, but it would be valueable to see if this has a relationship with Sale Price.
Now, let’s explore the categorical variables a bit.
table(kaggle_categorical$LandSlope)
##
## Gtl Mod Sev
## 1382 65 13
# table(kaggle_categorical$BldgType)
table(kaggle_categorical$HouseStyle)
##
## 1.5Fin 1.5Unf 1Story 2.5Fin 2.5Unf 2Story SFoyer SLvl
## 154 14 726 8 11 445 37 65
# table(kaggle_categorical$RoofStyle)
table(kaggle_categorical$RoofMatl)
##
## ClyTile CompShg Membran Metal Roll Tar&Grv WdShake WdShngl
## 1 1434 1 1 1 11 5 6
# table(kaggle_categorical$ExterCond)
table(kaggle_categorical$CentralAir)
##
## N Y
## 95 1365
# table(kaggle_categorical$Utilities)
table(kaggle_categorical$HeatingQC)
##
## Ex Fa Gd Po TA
## 741 49 241 1 428
table(kaggle_categorical$SaleType)
##
## COD Con ConLD ConLI ConLw CWD New Oth WD
## 43 2 9 5 5 4 122 3 1267
table(kaggle_categorical$SaleCondition)
##
## Abnorml AdjLand Alloca Family Normal Partial
## 101 4 12 20 1198 125
table(kaggle_categorical$Foundation)
##
## BrkTil CBlock PConc Slab Stone Wood
## 146 634 647 24 6 3
After looking through a large number of categorical values, and checking frequencies, a few things stood out to me:
It looks like there’s a majority of houses that are built on gentle slopes, however, there’s a proportion that are built on moderate or severe slopes – this could be something to look into further when building out the regression later on. It may be interesting to see if there is any correlation here with the Sale Price.
Similarly, the foundation material seems to be split heavily between cinder blocks and poured concrete. This is another feature that may be useful to explore later to see if this has an affect on housing prices, or if there is a connection between the year a house was built and which type of foundation was used.
Finally, Sale Type and Sale Condition were interesting values and their frequencies suggest that there isn’t a “one-size-fits-all” approach to certain transactions, making this an enticing set of features to explore further.
After plotting all of my quantitative variables earlier, I thought I’d hone in on a few that may be useful to explore their relationship with the Sale Price variable. You can find my plot below:
pairs(kaggle_df[, c('1stFlrSF', 'GrLivArea', 'GarageArea', 'TotalBsmtSF', 'YearBuilt', 'SalePrice' )], pch = 19)
For our correlation matrix, I will use GrLivArea, 1stFlrSF, and SalePrice. And taking a few of these same attributes, I thought it would be interesting to plot these in a correlation matrix:
kaggle_numerics %>%
dplyr::select(`1stFlrSF`, `GrLivArea`, SalePrice) %>%
pairs.panels(method = "pearson", hist.col = "#8bb397")
corr_matrix <- kaggle_numerics %>%
select(`1stFlrSF`, GrLivArea, SalePrice) %>%
cor() %>%
as.matrix()
As we can see from above, all three attributes, the square footage of the first floor, the above grade (ground) living area and the Sale Price all seem to have a positive relationship with one another. All three distributions also appear to be unimodal and right-skewed. Below is the official correlation matrix:
kable(corr_matrix) %>%
kable_styling(bootstrap_options = 'striped', full_width = FALSE)
| 1stFlrSF | GrLivArea | SalePrice | |
|---|---|---|---|
| 1stFlrSF | 1.0000000 | 0.5660240 | 0.6058522 |
| GrLivArea | 0.5660240 | 1.0000000 | 0.7086245 |
| SalePrice | 0.6058522 | 0.7086245 | 1.0000000 |
Now, although these correlations seem to be quite strong, we can do some hypothesis testing on their relationships at an 80-percent confidence interval:
corr1 <- cor.test(kaggle_numerics$SalePrice, kaggle_numerics$`1stFlrSF`, method = 'pearson', conf.level = 0.80)
corr2 <- cor.test(kaggle_numerics$SalePrice, kaggle_numerics$GrLivArea, method = 'pearson', conf.level = 0.80)
corr3 <- cor.test(kaggle_numerics$`1stFlrSF`, kaggle_numerics$GrLivArea, method = 'pearson', conf.level = 0.80)
corr1
##
## Pearson's product-moment correlation
##
## data: kaggle_numerics$SalePrice and kaggle_numerics$`1stFlrSF`
## t = 29.078, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.5841687 0.6266715
## sample estimates:
## cor
## 0.6058522
corr2
##
## Pearson's product-moment correlation
##
## data: kaggle_numerics$SalePrice and kaggle_numerics$GrLivArea
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6915087 0.7249450
## sample estimates:
## cor
## 0.7086245
corr3
##
## Pearson's product-moment correlation
##
## data: kaggle_numerics$`1stFlrSF` and kaggle_numerics$GrLivArea
## t = 26.217, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.5427732 0.5884078
## sample estimates:
## cor
## 0.566024
Solution: After running the three pairwise correlations between my variables of SalePrice, 1stFlrSF, and GrLivArea, I can see from the outputs that all three accept the alternative hypothesis that the true correlation is not equal to zero and that there is a positive relationship between these attributes. Therefore, this rejects the null hypothesis that states that the correlations between each pairwise set of variables is zero. I know this to be true since the p-values for all three of my Person’s correlation tests are less than 0.05.
Solution: Additionally, since we are dealing with a large sample size, and our correlation tests are yielding very small p-values, all less than 0.001, I’m not worried about family-wise error when measuring relationships across these three attributes. With p-values this low, it creates a solid argument that we aren’t running into a Type 1 error here.
Linear Algebra and Correlation: Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
Remembering that the inverse of a matrix multiplied by a matrix will become the identity matrix:
\[A^{−1}A=AA^{−1}=I\]
We can find our precision matrix by using the solve() function in baseR to invert our correlation matrix:
prec_matrix <- solve(corr_matrix)
prec_matrix
## 1stFlrSF GrLivArea SalePrice
## 1stFlrSF 1.6795240 -0.4611713 -0.690746
## GrLivArea -0.4611713 2.1352622 -1.233697
## SalePrice -0.6907460 -1.2336974 2.292718
Then, we can multiple the precision matrix by the correlation matrix to confirm that we get the identity matrix:
\[ correlation \ matrix \times \ precision \ matrix = \]
\[\begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ \end{bmatrix}\]I_1 <- round(prec_matrix %*% corr_matrix, 2)
I_2 <- round(corr_matrix %*% prec_matrix, 2)
I_1
## 1stFlrSF GrLivArea SalePrice
## 1stFlrSF 1 0 0
## GrLivArea 0 1 0
## SalePrice 0 0 1
I_2
## 1stFlrSF GrLivArea SalePrice
## 1stFlrSF 1 0 0
## GrLivArea 0 1 0
## SalePrice 0 0 1
As we can see above, we do indeed get the Identity Matrix when we multiply the precision matrix with the correlation matrix and vice-versa.
Finally, we will perform LU decomposition on the matrix. Fortunately, R has a package called lu.decomposition() that will find the upper and lower triangle matrices of our precision matrix, \(U\) and \(L\) respectively. The relationship between these two matrices is based on the formula:
\[A = LU\] Where \(A\) = our precision matrix, and \(L\) and \(U\) are the upper and lower triangular matrices of our precision matrix. We can see, when finding \(L\) and \(U\) below, we get these respective matrices:
A <- prec_matrix
L <- lu.decomposition(A)$L
U <- lu.decomposition(A)$U
round(L, 2)
## [,1] [,2] [,3]
## [1,] 1.00 0.00 0
## [2,] -0.27 1.00 0
## [3,] -0.41 -0.71 1
round(U, 2)
## [,1] [,2] [,3]
## [1,] 1.68 -0.46 -0.69
## [2,] 0.00 2.01 -1.42
## [3,] 0.00 0.00 1.00
And when we multiply these together, we can obtain our original precision matrix (\(A\)):
round(A, 2)
## 1stFlrSF GrLivArea SalePrice
## 1stFlrSF 1.68 -0.46 -0.69
## GrLivArea -0.46 2.14 -1.23
## SalePrice -0.69 -1.23 2.29
round(L %*% U, 2)
## [,1] [,2] [,3]
## [1,] 1.68 -0.46 -0.69
## [2,] -0.46 2.14 -1.23
## [3,] -0.69 -1.23 2.29
\[\begin{equation} LU = A, where \space A = \begin{bmatrix} 1.68 & -0.46 & -0.69\\ -0.46 & 2.14 & -1.23\\ -0.69 & -1.23 & 2.29\\ \end{bmatrix} \space = \space \begin{bmatrix} 1.00 & 0.00 & 0.00\\ -0.27 & 1.00 & 0.00\\ -0.41 & -0.71 & 1.00\\ \end{bmatrix} \space \begin{bmatrix} 1.68 & -0.46 & -0.69\\ 0.00 & 2.01 & -1.42\\ 0.00 & 0.00 & 1.00\\ \end{bmatrix} \space = LU \end{equation}\]
Calculus-Based Probability & Statistics: Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
For this exercise, we’ll use the GrLivArea variable from the training dataset. When we plot it below on a histogram we can quickly see that it is skewed to the right:
ggplot(kaggle_df, aes(GrLivArea)) +
geom_histogram(aes(y = ..density..), colour="#555555", fill="#dddddd") +
geom_density(alpha=.6, fill="#333333") +
geom_vline(xintercept = min(kaggle_df$GrLivArea)) +
labs(title = "Above Ground Living Area (square feet)", align='middle') +
theme_minimal()
We can also see that the minimum value is absolutely above zero, so we do not need to perform any shifts. We can double check the data by using the summary() function, and we see that the minimum value is 334 square feet:
summary(kaggle_df$GrLivArea)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1130 1464 1515 1777 5642
After loading the MASS package and running fitdistr() to fit an exponential probability density function to our variable, I’ve found that that the optimal value of \(\lambda\) for this distribution is 0.000659864.
require(MASS)
exp_fit <- fitdistr(kaggle_df$GrLivArea, "exponential")
est_rate <- exp_fit$estimate
est_rate
## rate
## 0.000659864
Using my newly found optimal value for lambda, I then was able to generate a 1000-sample variable from this exponential distribution:
set.seed(123)
pdf <- rexp(1000, est_rate)
pdf_df <- as.data.frame(pdf)
I then plotted this distribution below, along with my original distribution for GrLivArea:
A few takeaways when comparing the distributions:
The original data for GrLivArea appears to be shaped as a more normal distribution, albeit right skewed, the most frequent square footage tends to be around 1500 and 1650 square feet.
Alternatively, the exponential distribution of GrLiveArea appears to be even more right skewed, and takes on more of an exponential shape with a longer tail.
Next, I’ll find the 5th and 95th percentiles using the cumulative distribution function (CDF). Thinking back to the formula from our reading, this function is:
\[F(x) = 1 - e^{-\lambda x}\] Since we already have found \(\lambda = 0.000659864\), we can solve the equation accordingly for the 5th percentile:
\[0.05 = 1 - e^{-0.000659864x}\] \[-0.95 = e^{-0.000659864x}\] \[-ln(0.95) = 0.000659864x\] \[\frac{-ln(0.95)}{0.000659864} = x\] \[77.73313 = x\] Similarly, we can find the 95th percentile by:
\[0.95 = 1 - e^{-0.000659864x}\] \[-0.05 = e^{-0.000659864x}\] \[-ln(0.05) = 0.000659864x\] \[\frac{-ln(0.05)}{0.000659864} = x\] \[4539.92 = x\] We can check these calculations using the qexp() function in R:
qexp(0.05, rate=est_rate)
## [1] 77.73313
qexp(0.95, rate=est_rate)
## [1] 4539.924
It looks like these match the calculations above!
Assuming normality, we can generate a 95% confidence interval from the empirical data by utilizing the following function:
\[\overline{X} \pm Z * \frac{\sigma}{\sqrt{n}}\]
Since we have all of this information available to us, we can solve by utilizing the following:
\(Z\) = 1.96 (standard for a 95% confidence interval)
\(\overline{X}\) = mean value of GrLivArea
\(\sigma\) = standard deviation of GrLivArea mean
\(n\) = sample size of GrLivArea
X_cap <- mean(kaggle_df$GrLivArea)
Z <- 1.96
sigma <- std(kaggle_df$GrLivArea)
n <- length(kaggle_df$GrLivArea)
upper_bound <- X_cap + Z * (sigma / sqrt(n))
lower_bound <- X_cap - Z * (sigma / sqrt(n))
upper_bound
## [1] 1542.419
lower_bound
## [1] 1488.509
### talk about this tomorrow, notice that because original distribution is right skewed (not exactly normal), median would be better approximation here, so assuming normal distribution wouldn't be best here 1515.46 (mean) --> 1464 (median)
### CI [1542 - 1488], median falls outside of this boundary
mean(kaggle_df$GrLivArea)
## [1] 1515.464
median(kaggle_df$GrLivArea)
## [1] 1464
quantile(kaggle_df$GrLivArea, c(0.05, 0.95))
## 5% 95%
## 848.0 2466.1